Goto

Collaborating Authors

 optimal bias function


Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Neural Information Processing Systems

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SATH})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored.





Reviews: Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Neural Information Processing Systems

The paper focuses on the important problem of designing optimal algorithms for exploration-exploitation (whose upper-bound matches the lower bound). The paper is not well organized and written. It is difficult to abstract from the mathematical formulation and grasps the key ideas behind the improvement of the regret bound. As far as I understood, the first important component in improving the bound is to use variance dependent confidence intervals (ie Bernstein). Together with the knowledge of H, this allows designing a tighter optimism (Eq.


Reviews: Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Neural Information Processing Systems

This paper has lead to a long and thoughtful discussion between the reviewers. The main points that were raised are the following: The results are novel and close a long-standing gap between upper and lower bounds in a very important problem. While the reviewers have agreed that the results are significant and they definitely bring the field forward, an expert reviewer argued that the step forward is perhaps not significantly big enough to warrant publication in the present form. However, after much discussion, the other reviewers made a strong case for acceptance and all reviewers agreed that the community would clearly benefit from this paper being published. That said, I strongly encourage the authors to work hard on improving the presentation for the final version.


Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Neural Information Processing Systems

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function h {*}, the proposed algorithm achieves a regret bound of \tilde{O}(\sqrt{SATH}) \footnote{The symbol \tilde{O} means O with log factors ignored. Furthermore, this regret bound matches the lower bound of \Omega(\sqrt{SATH}) \cite{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of \tilde{O}(\sqrt{DSAT}) for MDPs with finite diameter D compared to the lower bound of \Omega(\sqrt{DSAT}) \cite{jaksch2010near}.


Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Neural Information Processing Systems

We present an algorithm based on the \emph{Optimism in the Face of Uncertainty} (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h {*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SATH})$\footnote{The symbol $\tilde{O}$ means $O$ with log factors ignored. This result outperforms the best previous regret bounds $\tilde{O}(HS\sqrt{AT})$\cite{bartlett2009regal} by a factor of $\sqrt{SH}$. Furthermore, this regret bound matches the lower bound of $\Omega(\sqrt{SATH})$\cite{jaksch2010near} up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{DSAT})$ for MDPs with finite diameter $D$ compared to the lower bound of $\Omega(\sqrt{DSAT})$\cite{jaksch2010near}.


Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

arXiv.org Machine Learning

We present an algorithm based on the Optimism in the Face of Uncertainty (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function $h^{*}$, the proposed algorithm achieves a regret bound of $\tilde{O}(\sqrt{SAHT})$for MDP with $S$ states and $A$ actions, in the case that an upper bound $H$ on the span of $h^{*}$, i.e., $sp(h^{*})$ is known. This result outperforms the best previous regret bounds $\tilde{O}(HS\sqrt{AT})$ [Bartlett and Tewari, 2009] by a factor of $\sqrt{SH}$. Furthermore, this regret bound matches the lower bound of $\Omega(\sqrt{SAHT})$ [Jaksch et al., 2010] up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of $\tilde{O}(\sqrt{SADT})$ for MDPs with finite diameter $D$ compared to the lower bound of $\Omega(\sqrt{SADT})$ [Jaksch et al., 2010].